-
Notifications
You must be signed in to change notification settings - Fork 82
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
docs: update usage and resources #886
Conversation
rewrites for clarity standarized references removed blockquotes added docusaurus style admonitions
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Added some comments :)
Maybe it would be worth to split the page to two?
Usage & resources
Optimization & tips?
|
||
## Requirements | ||
|
||
Actors built on top of the [Apify JS SDK](/sdk/js) and [Crawlee](https://crawlee.dev/) use autoscaling. This means that they will always run as efficiently as they can with the memory they have allocated. So, if you allocate 2 times more memory, the run should be 2 times faster and consume the same amount of compute units (1 * 1 = 0.5 * 2). Autoscaling for Python is not yet available, but it is planned for the near future. | ||
Actors built with [Apify JS SDK](/sdk/js) and [Crawlee](https://crawlee.dev/) use autoscaling. This means that they will always run as efficiently as they can based on the allocated memory. So, if you double the allocated memory, the run should be twice as fast and consume the same amount of compute units (1 * 1 = 0.5 * 2). |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This should be checked with someone from delivery / tooling. Not sure if this is 100% true, as it contradicts a bit the cheerio section later (with mention of node threads)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
it feels correct. 4g memory = 1cpu core, so the note about max memory of 4g for cheerio makes sense, the node process will only use a single thread there (sidenote: that's how we have it implemented, we could try to leverage worker threads or other similar parallelization features of node to get around this).
cc @metalwarrior665 for field experience :]
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
-
Yes, 4 GB is a target for non-browser Node.js actors because of the single core restriction.
-
Also, Apify SDK doesn't use any autoscaling, there is nothing to scale there, that is only related to Crawlee.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
So, as I understand, mention of Apify JS SDK should be removed from this paragraph, as it is only regarding Crawlee?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yep.
- Actors using [Puppeteer](https://pptr.dev/) or [Playwright](https://playwright.dev/) for real web browser rendering require at least `1024MB` of memory. | ||
- Large and complex sites like [Google Maps](https://apify.com/drobnikj/crawler-google-places) require at least `4096MB` for optimal speed and [concurrency](https://crawlee.dev/api/core/class/AutoscaledPool#minConcurrency). | ||
- Projects involving large amount of data in memory. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This would be better to be discussed with tooling / delivery.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think this is fine. The idea is that for a browser actor to start scaling concurrency, 1 GB is just not very useful but it can work as a minimum
|
||
### Maximum memory | ||
|
||
Apify Actors are most commonly written in [Node.js](https://nodejs.org/en/), which uses a [single process thread](https://betterprogramming.pub/is-node-js-really-single-threaded-7ea59bcc8d64). Unless you use external binaries such as the Chrome browser, Puppeteer, Playwright, or other multi-threaded libraries you will not gain more CPU power from assigning your Actor more than 4 GB of memory because Node.js cannot use more than 1 core. | ||
Apify Actors are most commonly written in [Node.js](https://nodejs.org/en/), which uses a [single process thread](https://betterprogramming.pub/is-node-js-really-single-threaded-7ea59bcc8d64). Unless you use external binaries such as the Chrome browser, Puppeteer, Playwright, or other multi-threaded libraries you will not gain more CPU power from assigning your Actor more than `4096MB` of memory because Node.js cannot use more than 1 core. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'd change the link to somehting else - it's quite old and at the referenced site you have to create account to read it.
I'd again discuss with tooling / delivery which would be the nice articles to refer to. (I also mean the next one on multiple threads)
When you run an Actor it generates platform usage that's charged to the user account. Platform usage comprises four main parts: | ||
|
||
- **Compute units**: CPU and memory resources consumed by the Actor. | ||
- **Data transfer**: Amount of data you transfered between web, Apify platform, and other external systems. | ||
- **Data transfer**: The amount of data transferred between the web, Apify platform, and other external systems. | ||
- **Proxy costs**: Residential or SERP proxy usage. | ||
- **Storage operations**: Read, write, and other operations towards key-value store, dataset, and request queue. | ||
- **Storage operations**: Read, write, and other operations performed on the Key-value store, Dataset, and Request queue. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'd add some info where to find the run usage - it's visible on run detail and in run list.
changed link to site without need for account added new screenshots showcasing where user can find runs usage
|
||
### Maximum memory | ||
|
||
Apify Actors are most commonly written in [Node.js](https://nodejs.org/en/), which uses a [single process thread](https://betterprogramming.pub/is-node-js-really-single-threaded-7ea59bcc8d64). Unless you use external binaries such as the Chrome browser, Puppeteer, Playwright, or other multi-threaded libraries you will not gain more CPU power from assigning your Actor more than 4 GB of memory because Node.js cannot use more than 1 core. | ||
Apify Actors are most commonly written in [Node.js](https://nodejs.org/en/), which uses a [single process thread](https://dev.to/arealesramirez/is-node-js-single-threaded-or-multi-threaded-and-why-ab1). Unless you use external binaries such as the Chrome browser, Puppeteer, Playwright, or other multi-threaded libraries you will not gain more CPU power from assigning your Actor more than `4096MB` of memory because Node.js cannot use more than 1 core. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
i believe its "single thread process", not "single process thread"
|
||
## Requirements | ||
|
||
Actors built on top of the [Apify JS SDK](/sdk/js) and [Crawlee](https://crawlee.dev/) use autoscaling. This means that they will always run as efficiently as they can with the memory they have allocated. So, if you allocate 2 times more memory, the run should be 2 times faster and consume the same amount of compute units (1 * 1 = 0.5 * 2). Autoscaling for Python is not yet available, but it is planned for the near future. | ||
Actors built with [Apify JS SDK](/sdk/js) and [Crawlee](https://crawlee.dev/) use autoscaling. This means that they will always run as efficiently as they can based on the allocated memory. So, if you double the allocated memory, the run should be twice as fast and consume the same amount of compute units (1 * 1 = 0.5 * 2). |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
it feels correct. 4g memory = 1cpu core, so the note about max memory of 4g for cheerio makes sense, the node process will only use a single thread there (sidenote: that's how we have it implemented, we could try to leverage worker threads or other similar parallelization features of node to get around this).
cc @metalwarrior665 for field experience :]
|
||
> It is possible to [use multiple threads in Node.js-based Actor](https://dev.to/reevranj/multiple-threads-in-nodejs-how-and-what-s-new-b23) with some configuration. This can be useful if you need to offload a part of your workload. | ||
It's possible to [use multiple threads in Node.js-based Actor](https://dev.to/reevranj/multiple-threads-in-nodejs-how-and-what-s-new-b23) with some configuration. This can be useful if you need to offload a part of your workload. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We also have guide directly for Crawlee https://crawlee.dev/docs/3.7/guides/parallel-scraping
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Later we can add horizontal scaling guide, especially since we have RequestQueueV2 now that supports multiple actors simultaneously.
rewrites for clarity
standarized references
removed blockquotes
added docusaurus style admonitions